Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get llm_filter to support document structure + similarity sorting for elements #876

Merged
merged 5 commits into from
Oct 4, 2024

Conversation

baitsguy
Copy link
Contributor

@baitsguy baitsguy commented Oct 4, 2024

1. Support document model in llm_filter and some performance improvements
llm_filter will work on a DocSet made of original Documents (i.e. pre-exploded state) when using use_elements=True

  • Element iteration "early stopping" - a record (document) passes the filter if any of the document's elements pass the llm_filter condition, we don't evaluate every element. This helps for documents which would be a hit, but doesn't for documents which would not (because we would still iterate over every element)
  • Optionally, it sorts elements by a similarity score to get potential llm hits sooner
  • keep_none flag to dictate behavior for a missing property value

None of these are perfect but they align with the use cases we have so far and can keep iterating.

2. Sorted elements returned by OpenSearchReader in reconstructed documents
Use element.element_index field to sort elements when reconstructing a document. Does an explicit sort after each reconstruction which is okay based on the magnitude of data.

Note: the ignore_doc_structure flag is to prevent current usage from breaking, once we update other code to use the document model we can remove/flip it

@baitsguy baitsguy marked this pull request as ready for review October 4, 2024 18:39
lib/sycamore/sycamore/docset.py Outdated Show resolved Hide resolved
lib/sycamore/sycamore/docset.py Outdated Show resolved Hide resolved
lib/sycamore/sycamore/docset.py Show resolved Hide resolved
lib/sycamore/sycamore/docset.py Show resolved Hide resolved
@baitsguy baitsguy requested a review from mdwelsh October 4, 2024 22:14
Copy link
Collaborator

@mdwelsh mdwelsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but LGTM

lib/sycamore/sycamore/docset.py Show resolved Hide resolved
lib/sycamore/sycamore/docset.py Outdated Show resolved Hide resolved
lib/sycamore/sycamore/docset.py Outdated Show resolved Hide resolved
lib/sycamore/sycamore/docset.py Outdated Show resolved Hide resolved
@baitsguy baitsguy enabled auto-merge (squash) October 4, 2024 23:37
@baitsguy baitsguy merged commit d3c4718 into main Oct 4, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants